Introduction

The first official case of COVID-19 in the USA has been confirmed on the 21st of January. About three months later, almost 1 million cases have been discovered. In this context of this pandemic, it is of utmost importance to understand how the pandemic evolves by reporting data in a clear an insgihtful way.

This project has two main goals:

  • visualize different COVID-related metrics: infection rate / 100k individuals, total cases, and total deaths at the county level (by representing maps and time series)

  • identify counties with potential errors in official counts: it has been shown that some counties negative counts in cumulative cases, which is not possible.

Notes about the report:

  • R code used to generate this report is provided. You just need to click on the “Code” button on the right to display the code used in a given section.

  • Most graphs and tables are interactive: you can zoom in and out, click on elements to display more content, or search for specific data points.

Data preparation

Dependencies and input files

We first need to load R packages (mostly used for interactive visualizations).

Then we set the location of input files used in this project:

  • a web scraped data file (april15.csv) containing COVID data for a single time-point

  • official population counts in US counties (census), directly taken from the web

  • the New-York Times COVID-19 data, taken from the web

  • official counties borders to be displayed on the map (downloaded from census.gov)

Getting and preparing NYT data

Another dataset we’re gonna look at is provided by the New-York Times. We load it as a data frame containing different variables: the date, county, state, FIPS (county ID), cases and deaths. We format the county ID just like before.

Compute rate per 100k using census data for both datasets

For the two COVID datasets, we have the number of cases and deaths. We can compute for each dataset two other metrics: the rate of cases per 100k inhabitants, and the rate of deaths per 100k.

Maps based on web-scraped and NYT data

We want to represent our COVID data on a map of the USA. To do so, we will generate maps using the Leaflet framework. The idea is to add polygons representing counties to the basemap. Polygon coloring depends on the metric of interest (here, total cases or rate /100k). Clicking on a county gives more information about this area.

Total number of cases (April 15 data)

First, we represent the total number of cases in each county.

Number of cases per 100,000 (April 15 data)

Next, we can represent the rate of cases per 100,000 individuals.

Number of cases per 100,000 (NYT data, last day available)

Finally, we can represent the rate of cases per 100,000 individuals, with NYT data this time (by first taking the most recent timepoint available in the dataset).

Evolution of COVID cases over time (New York Times data)

We can now have a look at COVID data over time. To do so, we will represent COVID cases evolution for the 50 most affected counties (most affected = highest number of cases so far).

Total cases over time

We can first represent the evolution of the total number of cases in the most affected counties. A subset of these 50 curves can be selected using the boxes on the left (filtering by states or counties).

Rate of infection over time

We can now represent the evolution of prevalence in these counties.

Identifying counties with unexpected patterns

It has been observed that some data might be erroneous as the cumulative number of cases sometimes go down, which is not supposed to happen. We need to identify the reasons underlying these observations. To do so, we’ll first automatically identify counties reporting a negative difference in the cumulative number of cases from one day to the next one.

List of counties with negative differences

The idea now is to compute the difference in case values from one day to the next one in order to identify potential negative difference (cumulative number of cases going down). We can then report the counties presenting such negative difference in a table.

Time series for counties with largest discrepancies

We can have a look at the time series of some counties presenting a large negative difference in cases number (> 10).

##  [1] "Cullman 01043"     "Onondaga 36067"    "Tazewell 17179"   
##  [4] "Dougherty 13095"   "Carson City 32510" "Ripley 18137"     
##  [7] "Oakland 26125"     "Madison 01089"     "Lafayette 22055"  
## [10] "Rensselaer 36083"  "St. Charles 22089" "Tuscaloosa 01125" 
## [13] "Lexington 45063"   "St. Landry 22097"